working directory, packages, functions, global variables.
The purpose of this analysis is to link dengue incidence data from OpenDengue with spatiotemporal environmental data published in Siraj et al. (2018). Unfortunately, these two sources use two different identifier systems for municipalities. Siraj et al. (2018) uses identifiers provided by the Colombian government, while the OpenDengue database uses FAO GAUL identifiers. I was not able to find a resource that links these two identifier systems directly. However, I was able to retrieve shape files containing the FAU GAUL identifiers (corresponding author of the OpenDengue database, personal communication), and the Colombian identifier system (United Nations for the coordination of humanitarian affairs, OCHA).
A shape file is a simple, non-topological format for storing the geometric location and attribute information of geographic features. Here, the geographic features represent the boundaries of the municipalities in Colombia. I will us the geographic feature information as an aid to link the two identifier systems.
For the Siraj data, there is only one ID that is not present in the OCHA shape file - “Rea en Litigo” with an ID of 99999. However, there does not seem to be any dengue incidence data for “Rea en Litigo” anyways. For the Open Dengue data, all FAO GAUL IDs can be found in the provided shape file. This is not surprising, because the shape file was provided by the Open Dengue corresponding author.
For the Siraj data, I had to fall back on shape files that have been released after the publication of the data (the original files were not available). To see, whether there is consistency in municipality names, we will match the spatiotemporal environmental data from Siraj et al. (2018) with the OCHA data by ID, and compare the differences in naming with the help of a “string distance score”: if two strings are identical, the score is 0.
Overall, there is good agreement between the municipality names in Siraj et al. and the OCHA data (with the exception of the municipality with the ID 88564). Differences are mostly caused by additional information included in the municipality names in the Siraj at al data. Additionally, special characters are missing in the municipality names in the Siraj et al. data (confirmed by Amir Siraj, personal communication).
ocha_id <- gsub("CO", "",ocha$ADM2_PCODE) #prepare OCHA ID
ocha_id <- as.integer(unique(ocha_id))
env_id <- as.integer(unique(env$ID_ESPACIA)) #prepare Siraj ID
table(env_id %in% ocha_id) #one Siraj ID not present in OCHA
##
## FALSE TRUE
## 1 1122
absent_id <- setdiff(env_id, ocha_id)
env[which(env$ID_ESPACIA == absent_id),] # not present = REA EN LITIGIO
## ID_ESPACIA NOM_MUNICI Wpop2015 WpopBirths2015 UrbanPop MeanGCP_2005USD
## 376 99999 REA EN LITIGIO 3994.193 37.40853 0 3840.21
## MeanTraveltime
## 376 265.7429
table(dengue$adm_2_name[grep("REA", dengue$adm_2_name)]) #not present in incidence data
##
## ASTREA
## 166
dengue_id <- as.integer(unique(dengue$FAO_GAUL_code)) #prepare OpenDengue ID
od_id <- as.integer(unique(od$GAUL_CODE))
table(dengue_id %in% od_id) #all OpenDengue IDs present in Open Dengue shape file
##
## TRUE
## 1007
rm(absent_id, dengue_id, ocha_id, od_id, env_id)
The linkage of spatiotemporal environmental data from Siraj et al. (2018) and dengue incidence data from the OpenDengue database comes with several challenges/caveats: - municipality names are not unique. Two municipalities (admin 2 level) can have the same name, but occur in two different departments (admin 1 level). - Siraj et al. (2018): special characters in the municipality names are missing - The FAO GAUL codes are not uniquely matched with individual municipalities: 2-4x assignments are presents (at a low frequency) too. The plots below demonstrate this problem. pink = region of one FAO GAUL code (OpenDengue shape file); blue = municipalities with same FAO GAUL Code (Ocha shape file). Thus, the shape files and FAO GAUL codes are only of limited use.
fao_gaul_code <- subset(dengue, select = c("FAO_GAUL_code", "full_name", "adm_2_name"))
fao_gaul_code <- fao_gaul_code[!duplicated(fao_gaul_code$full_name),]
print(nrow(fao_gaul_code)) #number municipalities
## [1] 1063
print(length(unique(fao_gaul_code$FAO_GAUL_code))) #number FAO GAUL codes
## [1] 1007
plot_gaul(gaul_id = 13521, od_shape = od, ocha_shape = ocha, dengue_data = dengue)
plot_gaul(gaul_id = 13441, od_shape = od, ocha_shape = ocha, dengue_data = dengue)
plot_gaul(gaul_id = 13516, od_shape = od, ocha_shape = ocha, dengue_data = dengue)
rm(fao_gaul_code)
I match municipality and department names of the OpenDengue incidence data with the names available in the OCHA data. I will not use the Siraj et al. data directly, because it misses special characters. However, we have already established above that overall, the ID-assignment between the two data sets seems to be consistent. While the geographic feature information alone is not sufficient for matching (see examples above), I will use it as a quality control for incidences where the names do not match perfectly.
As a first step, we will match IDs, where both the standardized ( = lower case and latin ascii) municipality name and department name matches perfectly between the OCHA and the Siraj data (894 out of 1063 cases).
#STEP 0: Generate variables for tracking ----
tobeLinked_ID <- unique(dengue_link$ID) #OpenDengue IDs to be linked
tobeLinked_df <- link_df #Data frame with OpenDeuge IDs ~ OCHA IDs link
# STEP 1: Perfect Match ----
id <- which(link_df$adm1Score == 0 & link_df$adm2Score == 0)
length(unique(link_df$dengueID[id])) == nrow(link_df[id,]) #sanity check: all unique matches
## [1] TRUE
final_link <- tobeLinked_df[id, ]
#adjust remaining municipalities
tobeLinked_ID <- tobeLinked_ID[-which(tobeLinked_ID %in% final_link$dengueID)]
tobeLinked_df <- tobeLinked_df[-which(tobeLinked_df$dengueID %in% final_link$dengueID),]
table(tobeLinked_df$adm2Score, tobeLinked_df$adm1Score) #remaining Scores
##
## 0 3 4 5 6 7 8 9 10 15
## 0 0 55 4 1 1 3 3 1 33 0
## 1 11 0 0 0 0 0 0 0 0 0
## 2 3 2 3 0 5 0 0 0 0 0
## 3 6 1 1 0 1 1 0 0 0 0
## 4 3 0 0 1 2 1 1 1 0 0
## 7 0 0 0 1 0 0 0 0 0 0
## 8 2 0 0 2 1 0 0 0 0 0
## 9 2 0 2 0 1 2 0 0 0 0
## 10 2 0 0 0 0 0 1 0 0 0
## 11 2 0 0 3 1 0 0 0 0 1
## 12 1 0 0 0 1 0 0 1 0 0
## 13 3 0 1 1 0 0 0 0 0 0
## 14 2 0 0 0 0 0 0 0 0 0
## 15 2 0 0 0 0 0 0 0 0 0
## 16 1 0 0 0 0 0 0 0 0 0
## 17 1 0 0 0 0 0 0 0 0 0
Next, we will look at scenarios where the municipality name is a perfect match, but there are mismatches in the department names.
The majority of these cases is caused by the departments norte de santander (OCHA)/norte santander (Open Dengue); valle del cauca (OCHA)/valle (Open Dengue); la guajira (OCHA)/guarija (Open Dengue). Visual inspection of the geographic feature information suggest that these cases match. Thus, we include them (981 out of 1063 matched).
#STEP 2: Perfect Munic. Score, Difference in Dep. ----
id <- which(tobeLinked_df$adm2Score == 0 & tobeLinked_df$adm1Score > 0)
table(tobeLinked_df$adm1_std[id])
##
## atlantico boyaca cauca la guajira
## 1 1 3 15
## meta narino norte de santander risaralda
## 1 1 39 1
## santander sucre valle del cauca
## 3 3 33
#norte de santander (OCHA) = norte santander (Open Dengue)
plot_gaul_byIDs(od_id = 14133, od_shape = od, ocha_id = "CO54206", ocha_shape = ocha)
#valle del cauca (OCHA) = valle (Open Dengue)
plot_gaul_byIDs(od_id = 13964, od_shape = od, ocha_id = "CO44090", ocha_shape = ocha)
#la guajira (OCHA) = guarija (Open Dengue )
plot_gaul_byIDs(od_id = 14378, od_shape = od, ocha_id = "CO76233", ocha_shape = ocha)
#STEP 2a: Dep. Santander, Valle, Guajira ----
id <- which(tobeLinked_df$adm2Score == 0
& tobeLinked_df$adm1_std %in%
c("norte de santander", "valle del cauca", "la guajira"))
#sanity check: only diff. in adm1 level
table(tobeLinked_df$adm2Score[id], tobeLinked_df$adm1Score[id])
##
## 3 10
## 0 54 33
#sanity check: constant difference in adm1 level
table(tobeLinked_df$adm1Score[id], tobeLinked_df$adm1_std[id])
##
## la guajira norte de santander valle del cauca
## 3 15 39 0
## 10 0 0 33
#sanity check: only unique matches
length(unique(tobeLinked_df$dengueID[id])) == length(id)
## [1] TRUE
final_link <- rbind(final_link, tobeLinked_df[id, ])
#adjust remaining municipalities
tobeLinked_df <- tobeLinked_df[-which(tobeLinked_df$dengueID %in% final_link$dengueID),]
tobeLinked_ID <- tobeLinked_ID[-which(tobeLinked_ID %in% final_link$dengueID)]
table(tobeLinked_df$adm2Score, tobeLinked_df$adm1Score) #remaining Scores
##
## 0 3 4 5 6 7 8 9 15
## 0 0 1 4 1 1 3 3 1 0
## 1 11 0 0 0 0 0 0 0 0
## 2 3 2 3 0 5 0 0 0 0
## 3 6 1 1 0 1 1 0 0 0
## 4 3 0 0 1 2 1 1 1 0
## 7 0 0 0 1 0 0 0 0 0
## 8 2 0 0 2 1 0 0 0 0
## 9 2 0 2 0 1 2 0 0 0
## 10 2 0 0 0 0 0 1 0 0
## 11 2 0 0 3 1 0 0 0 1
## 12 1 0 0 0 1 0 0 1 0
## 13 3 0 1 1 0 0 0 0 0
## 14 2 0 0 0 0 0 0 0 0
## 15 2 0 0 0 0 0 0 0 0
## 16 1 0 0 0 0 0 0 0 0
## 17 1 0 0 0 0 0 0 0 0
sum(tobeLinked_df$adm1Score >0 & tobeLinked_df$adm2Score == 0)
## [1] 14
After accounting for the three departments, there are 14 remaining matches that have perfect municipality match, but mismatches in the department names. They are all rather dubious and I did not include them.
Next, we will look at scenarios where the department is a perfect match, but there are mismatches in the municipality names (n = 41). Here, I also used visual inspection of the geographic features whether the match seems correct (i.e., partially similar municipality shape, considerable overlap) or whether it should be discarded (see plots below).
## [1] 41
##
## 2 3 4 7 8 9 10 11 12 13
## 3 2 1 0 0 0 0 0 0 0 0
## 4 3 1 0 0 0 2 0 0 0 1
## 5 0 0 1 1 2 0 0 3 0 1
## 6 5 1 2 0 1 1 0 1 1 0
## 7 0 1 1 0 0 2 0 0 0 0
## 8 0 0 1 0 0 0 1 0 0 0
## 9 0 0 1 0 0 0 0 0 1 0
## 15 0 0 0 0 0 0 0 1 0 0
After these three steps, I matched 1009 out of 1063 municipalities (95 %). The remaining matches were discarded.
head(final_link)
## Simple feature collection with 6 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -75.38351 ymin: 3.745484 xmax: -71.94885 ymax: 10.79196
## Geodetic CRS: WGS 84
## Shape_Leng Shape_Area ADM2_ES ADM2_PCODE
## 294 2.0688161 0.077934078 El Carmen de Bolívar CO13244
## 532 2.2326910 0.092943392 Magangué CO13430
## 262 1.8718671 0.096242304 Cubará CO15223
## 745 1.4605427 0.032741165 Purificación CO73585
## 1058 0.5499045 0.008376978 Usiacurí CO08849
## 889 2.4343919 0.091328821 San Vicente de Chucurí CO68689
## ADM2_REF ADM2ALT1ES ADM2ALT2ES ADM1_ES ADM1_PCODE ADM0_ES
## 294 El Carmen de Bolivar <NA> <NA> Bolívar CO13 Colombia
## 532 Magangue <NA> <NA> Bolívar CO13 Colombia
## 262 Cubara <NA> <NA> Boyacá CO15 Colombia
## 745 Purificacion <NA> <NA> Tolima CO73 Colombia
## 1058 Usiacuri <NA> <NA> Atlántico CO08 Colombia
## 889 San Vicente de Chucuri <NA> <NA> Santander CO68 Colombia
## ADM0_PCODE date validOn validTo adm2_std adm1_std
## 294 CO 2018-01-01 2020-04-16 <NA> el carmen de bolivar bolivar
## 532 CO 2018-01-01 2020-04-16 <NA> magangue bolivar
## 262 CO 2018-01-01 2020-04-16 <NA> cubara boyaca
## 745 CO 2018-01-01 2020-04-16 <NA> purificacion tolima
## 1058 CO 2018-01-01 2020-04-16 <NA> usiacuri atlantico
## 889 CO 2018-01-01 2020-04-16 <NA> san vicente de chucuri santander
## adm2Score adm1Score dengueID geometry
## 294 0 0 1 MULTIPOLYGON (((-75.3143 9....
## 532 0 0 2 MULTIPOLYGON (((-74.73933 9...
## 262 0 0 3 MULTIPOLYGON (((-72.17368 7...
## 745 0 0 4 MULTIPOLYGON (((-74.86873 3...
## 1058 0 0 5 MULTIPOLYGON (((-74.97205 1...
## 889 0 0 6 MULTIPOLYGON (((-73.52601 7...
nrow(final_link)/nrow(dengue_link) #able to link 95 % of municipalities.
## [1] 0.9492004
#subset to IDs
final_link <- as.data.frame(subset(final_link, select = c("dengueID", "ADM2_PCODE")))
final_link$geometry <- NULL
#sanity check: any double entries
any(duplicated(final_link$dengueID))
## [1] FALSE
any(duplicated(final_link$ADM2_PCODE))
## [1] FALSE
#store result
res <- merge(dengue_link, final_link, by.x = "ID", by.y = "dengueID")
write.table(res, file = "DD-AML-res-IDLinks.txt",
col.names = T, row.names = F, quote = T, sep = "\t")